Annual Compensation of Respondents

Objective: create a histogram that shows the binned annual compensations of respondents.

This notebook requires binning data, and creating the bin labels and intervals, to arrive at the final histogram.

Notebook setup

Here we import the necessary modules

And load the CSV

We can keep only the columns needed for this notebook

We only need to keep the column of annual compensation (USD).

In order to create the final histogram, first we need to:

Filter out bad data

Since I go through the exploratory data analysis part in the "Age of Respondents" notebook, we will have only the filter operation in this one, not the exploration of where to set the limits.

Create the bins

For creating the bins, we'll make use of list comprehensions for compact code and pandas' IntervalIndex.

Let's start by creating the bin labels. In this case they will be almost the same as the bins themselves, except the labels are written in thousands of USD for readability.

The bin ranges are very similar to the labels, but this second list comprehension creates tuples of integers instead of custom strings with the open and closed nomenclature. And as mentioned before, the actual bin ranges are not converted to the thousands, otherwise we'd need an extra data transformation.

For more information on IntervalIndex, and from_tuples in specific, please refer to its documentation and, alternatively, a Medium article I wrote to see it in action. But in TL;DR fashion, it creates proper bin ranges based on tuples, closed on the left.

Bin the data

Now we can take that IntervalIndex and use it to put the compensations in their respective bins.

Oh and we need to store the bins as strings, as Plotly doesn't support IntervalIndex, or rather, category.

Plot the histogram